--- title: "Final Report" date: 2020-05-06 author_profile: true tags: [R] excerpt: "Econ 490 Final Report" layout: single toc: true toc_label: "Directory" toc_sticky: true ---
Health has always been an integral part in the well-being of society and understanding this relationship is crucial to truly appreciate how far we have become. Naturally with any essential need, there is a sector that is dedicated to carry out services to maintain our health. Though like other critical industries, we have to acknowledge the monetary implications and effect on the overall economy. The CDC has stated that the cost of health expenditures per capita totaled $10,739 in 2017 with national health expenditure reaching $3.5 trillion. According to the Center for Medicare & Medicaid Services (CMS), the healthcare industry made up 17.9 percent of GDP in 2017 and is being projected to reach 19.4 percent by 2027. Amongst some of the reasons for this was the rise of “Key economic and demographic factors fundamental to the health sector” (CMS, 2017). Medical costs are also expected to rise due to upward trends in pricing of prescription drugs, and physician and clinical services. In this paper I hope to find out more about some of the effects that demographics already play in the cost of health care and see why and how these demographics might be growing in the coming years.
In this report I will be using the Insurance Dataset which is comprised from a simulation based on demographic statistics from the US Census Bureau and first appeared in the Book machine learning with R by Brett Lantz.
Insurance = read.csv("insurance.csv")
attach(Insurance)
Throughout this Paper I will be looking at medical cost data and what factors might best predict what costs are on a personal basis. The news has been dominated by stories about health care recently as more people are questioning the costs and methods in which it should be distributed. By getting a better understanding of the cost we may be able to create inferences and identify reasons why medical costs might be getting higher or affecting certain groups of people differently.
names(Insurance)
## [1] "age" "sex" "bmi" "children" "smoker" "region" "charges"
These are the variables that exist whithin the insurance data set.
The first variable will be charges, this is the dependent variable in our study and is in dollar units. This is the cost that has been billed to patients from their health care provider.
The first independent variable will be age. The age of the patient could have an effect on their medical costs since many companies will classify customers based on their age in age brackets. These different brackets could range from university health care plans to senior care and could have different costs and services included within them.
The next independent variable will be sex. This is the born gender of each patient in the study. Gender bias has been a deciding factor in other major subjects in the past so it would be interesting to see a bias in medical cost as well.
The second Variable used will be BMI or Body Mass Index. This is a comparison of one’s body mass compared to their height. The notion here is that people who are considered obese by BMI might have higher costs since their weight may lead to more health-related issues hence classifying them with higher risk then a healthy person. However, it should be noted that BMI isn’t as affective when considering muscle weight as many people who lift and are visibly healthy may be considered overweight by BMI. This variable is in units of (kg/m^2).
The next variable we will look at will be children. This is a count of how many children/ dependents are on one’s health plan. The possibility of more than one person on a health plan is expected to bring costs up since there is more risk on an account with more people associated with it.
The next variable will be smoker. This is a binary variable of yes or no showing whether the patient is a smoker. It is known that smoking can lead to numerous health issues, so companies are expected to treat these patients with more risk than those who do not smoke.
The last independent variable we will be looking at is what region the patient is from in the United States. The regions used in the data are northeast, southeast, southwest, northwest. Relative prices on other goods and services can be prone to regional effects so it would be interesting to see if medical costs follow suit. I do know that in areas like the west there are fewer medical facilities and personal available so I would assume costs would be higher due to its inelastic nature. Another note about this variable is the absence of the Midwest region (great plains). This may impact the results of each region as they all will cover more land than usual and will include a variety of political identities amongst states which might impact costs on a state basis.
summary(Insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
The summary statistics will serve as a good way to start looking at the potential trend of each variable in the data set. Medical Charges seem to have a mean of $13,270 while having 75% of the observation amounts being less than or equal to $16,640. The max amount is $63,370 which is a lot higher than the rest of the observations which may be attributed to the wide variety of medical conditions, all having different expense to deal with the individual problem. Sex and Region look to have equal representation in the data whereas the Smoking dummy variable has a ratio around 4/1 for nonsmokers to smokers. The majority of Patients are considered overweight when looking at BMI as well as a large obese population. It will be interesting to see if the subgroups for each variable provide insight to the disparity seen in medical charges.
ggplot(Insurance, aes(sex))+
geom_bar()+
labs(title="Barchart of Gender", x="Gender", y="Count")
For qualitative variables I will be using bar charts to help show the data. Based on the graph here we can see that the levels of men and woman are roughly the same. This is helpful in ensuring the data won’t be skewed toward one group.
ggplot(Insurance, aes(age))+
geom_histogram(aes(age),binwidth = 1, col = 'blue', size = .5)+
labs(title="Histogram of Age varibale", x="Age", y="Count")
In this histogram we can see that there is a level amount of people of different ages except when it comes to those who are younger.
ggplot(Insurance, aes(age))+
geom_histogram(aes(fill = sex),binwidth = 1, col = 'blue', size = .5)+
labs(title="Histogram of Age varibale", x="Age", y="Count")
This plot shows that gender is evenly distributed over the age ensuring that the count isn’t inherently skewed in favor of one gender.
ggplot(Insurance, aes(bmi))+
geom_histogram(aes(bmi),binwidth = 1, col = 'blue', size = .5)+
labs(title="Histogram of BMI", x="BMI", y="Count")
While looking at a histogram of BMI we can see that the data is slightly skewed toward the right and varies from the traditional bell curve. It should be noted that an optimal BMI should range from 18.5 to 25 while a range of 25 to 30 is considered overweight. It seems like most of the data is centered around the 25-30 area which is the average measurement for men in the USA. There is a good amount of those with scores above 30 indicating obese status so it will be interesting seeing if that impacts the study going forward.
ggplot(Insurance, aes(children))+
geom_histogram(aes(children),binwidth = 1, col = 'blue', size = .5)+
labs(title="Histogram of Amount of Children", x="Number of Chilren", y="Count")
In the histogram for number of children we can see that a large portion of the participants do not have any children. This corresponds with the graph of age earlier as most of the younger participants will not have kids or families for that matter. It should also be noted that the more kids a person has, the less frequent the subgroup is.
ggplot(Insurance, aes(smoker))+
geom_bar(width=0.5)+
labs(title="Barchart of Smoke Status", x="Smoke Status", y="Count")
In this bar chart we can see that there are substantially more non-smokers than there are smokers in the study. However, there is still enough smokers to create a subgroup and see if they will have an impact on their medical costs which is what I suspect them to have.
ggplot(Insurance, aes(region))+
geom_bar(width=0.8)+
labs(title="Barchart of Region", x="Region", y="Count")
Our final variable is region of each participant. Since the amount of people in each region is very close, we can get an accurate measure of the impact a region may or may not have when comparing subgroups. Regional pricing makes sense for some goods and services so well see how well this holds up for this study.
ggplot(Insurance, aes(x=as.factor(smoker), y= charges)) +
geom_boxplot()+
ggtitle("Smoking Effect on Charges")+
labs(y="Charges", x="Smoke Status")
While looking at this data I could see a clear trend amongst those who smoked. Their median levels were significantly higher than the top 25% of non-smokers. I suspect these medical charges could be a result of some portion dealing with a serious illness like throat cancer.
ggplot(Insurance, aes(x=age, y=charges))+
geom_point(aes(col = smoker), size = 1) +
geom_smooth(method='lm', color = "red")+
ggtitle("Age of Patients and Charge Amount")+
labs(y="Charge Amount", x="Age of Patients")
The second correlation plot I made was with the age variable. As we can see in the graph, there are three different price ranges that show up in the data. This can be attributed to those who smoke since they have higher prices on average. The main density of the model is in the lower end of price as the regression line is slightly above that area. Age does seem to have an impact as prices increase across all charge levels with age. This might be from the cost of prescription drugs and surgeries that older people are more likely to receive. Though it seems like smoking plays a bigger role in determining the price area.
ggplot(Insurance, aes(x=as.factor(sex), y= charges)) +
geom_boxplot()+
labs(x="Gender", y="Charges")+
ggtitle("Gender of Patients and Charges")
As we can see in the box plot graph, the variance in the data is more dispersed in males then in females. While the medians are similar, it seems like the upper 25% of men have higher charges than the top 25% of women. However, is should also be noted that the bottom 25% of both men and women are very similar. This could be families on the same plan or maybe the student health plans of some of the younger participants.
ggplot(Insurance, aes(x=bmi, y=charges))+
geom_point(col ='blue', size = 1) +
geom_smooth(method='lm', color = "red")+
ggtitle("BMI and Charge Amount")+
labs(y="Charge Amount", x="BMI of Patients")
In this chart we can see the relationship between BMI and medical charges. It seems like there are two things happening in this chart. We can see that for those in the lower range of prices, BMI doesn’t seem to have a significant effect. However, a trend can be seen in the medium to high price ranges where the BMI does correspond to higher costs. The data is pretty noisy so significance might be relied more to other variables. Given the big split I suggest that making subgroups might give some more insight why the graph looks like this.
ggplot(Insurance, aes(x=bmi, y=charges))+
geom_point(aes(col = smoker), size = 1) +
geom_smooth(method='lm', color = "red")+
ggtitle("BMI and Charge Amount")+
labs(y="Charge Amount", x="BMI of Patients")
When accounting for smoking status it is clear that there is a trend with smoking and the occurrence of higher prices. BMI still has a slight effect on charges, but it doesn’t match the magnitude as the smoking variable.
ggplot(Insurance, aes(x=as.factor(children), y= charges)) +
geom_boxplot()+
labs(y="Charges", x="Number of Children")+
ggtitle("Number of Children and Charges")
After looking at the relationship of charges and number of children it can be inferred that there is not a big impact on overall medical cost. There is an upward trend while comparing people with no kids up until people with 3 kids but then it starts to sip down. I know there were few people in the study who were in the groups with 4+ children so maybe adding more would yield a different result but so far there isn’t any evidence of this being a deciding factor.
ggplot(Insurance, aes(x=as.factor(region), y= charges)) +
geom_boxplot()+
ggtitle("Medical Charges by Region")+
labs(y="Charges", x="Region")
The final variable that we will be looking at is region. Based on the graph we can see that there is not a huge difference amongst those in different regions. The only thing to note is that the top 25% of those in the southeast region may have higher medical costs. My initial hypothesis for this is that there are more older people in that part of the country, especially in Florida. As we saw earlier with age, older people tend to have higher medical costs.
In this part, I will run some regressions and pick a model using forward selection.
lm1= lm(charges~smoker+bmi)
lm2=lm(charges~smoker+bmi+age)
lm3 =lm(charges~smoker+bmi+age+children)
lm4=lm(charges~smoker+bmi+age+children+region)
lm5=lm(charges~smoker+bmi+age+children+region+sex)
It makes sense that most of the variables yielded a positive relationship with medical charges as intuition would suggest that variables like growing older and being overweight might play a big role in determining costs. Smoking having the largest effect does make sense since there are numerous known risks associated with smoking and we know it could lead to lung cancer which would bring medical cost to a very high amount for some. It was interesting to see that sex did not play a big role in medical costs as this is the variable that will be dropped from the multiple regression. This is good as a gender bias may not be a big factor in these prices. When comparing the Adjusted R^2 values we can that the value grows within each regression until the end. This means that the best regression model is the lm4 model with an adjusted R^2 value of 0.7496.
summary(lm4)
##
## Call:
## lm(formula = charges ~ smoker + bmi + age + children + region)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
The Best Regression was the fourth regression which included the following variables: smoker, bmi, age, children, and region. All effects on medical charges are expressed in US dollars.
B0 (aka nothing but being in Northeast region) = -$11,990.27.
B1 = $23,836.30, this is the effect if someone is a smoker.
B2= $338.66, positive effect on charges if there is a one unit increase of bmi.
B3= $256.97, positive effect on charges if there is a one unit increase of age.
B4= $474.57, positive effect on charges if an individual has one more child in family.
regionnorthwest= -352.18, This dummy variable indicates that people in the Northwest region have charges $352.18 less than those in the Northeast. regionsoutheast= -1,034.36, This dummy variable indicates that people in the Southeast region pay $1,034.36 less on average than those in the Northeast. regionsouthwest= -959.37, this dummy variable indicates that people in the Southwest pay $959.37 less on average than people in the Northeast.
All of the coefficients seem to be significantly different from zero on a 5% level with the exception of the Northwest region dummy variable. F-statistic: 572.7 on 7 and 1330 DF. This F stat is very far from zero so we can reject the null hypothesis that the model doesn’t have any significant variables in it.
ggplot() +
geom_point(aes(x = Insurance$charges, y = predict(lm4)),
colour = 'blue') +
geom_line(aes(x = Insurance$charges, y = Insurance$charges ),
colour = 'red') +
ggtitle('Predicted Medical Charges vs Actual Medical Charges') +
xlab('Actual y') + ylab('Predicted y')
This plot shows that the predictions in our model are fairly accurate in the cases where charges are low. We can see that variance doesn’t change too much toward the middle of the plot but then increases at the end. The model isn’t too accurate for predicting those with larger medical costs.
While looking at these coefficients we can look at the standard errors to see how far off the values are.
lm.summary = summary(lm4)
lm.summary$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.2699 978.76153 -12.2504507 9.057930e-33
## smokeryes 23836.3005 411.85645 57.8752633 0.000000e+00
## bmi 338.6646 28.55895 11.8584404 6.601498e-31
## age 256.9736 11.89136 21.6101102 5.235999e-89
## children 474.5665 137.73999 3.4453790 5.880165e-04
## regionnorthwest -352.1821 476.12044 -0.7396912 4.596179e-01
## regionsoutheast -1034.3601 478.53728 -2.1615038 3.083440e-02
## regionsouthwest -959.3747 477.77820 -2.0079917 4.484608e-02
coefs = as.data.frame(lm.summary$coefficients[-1,1:2]) # -1 is to exclude the intercept
names(coefs)[2] = "se"
coefs$vars = rownames(coefs)
ggplot(coefs, aes(vars, Estimate)) +
geom_errorbar(aes(ymin=Estimate - 1.96*se, ymax=Estimate + 1.96*se), lwd=1, colour="red", width=0) +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se), lwd=1.5, colour="blue", width=0) +
geom_point(size=2, pch=21, fill="yellow")+
ggtitle('Estimated Coefficients and Standard Error')
Smoking does have a rather large effect on the data, and its standard errors are quite different so we can take a look at the other variables to get a better look.
#coefs with smoking removed
coefs = as.data.frame(lm.summary$coefficients[c(-1,-2),1:2]) # -1 is to exclude the intercept
names(coefs)[2] = "se"
coefs$vars = rownames(coefs)
ggplot(coefs, aes(vars, Estimate)) +
geom_errorbar(aes(ymin=Estimate - 1.96*se, ymax=Estimate + 1.96*se), lwd=1, colour="red", width=0) +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se), lwd=1.5, colour="blue", width=0) +
geom_point(size=2, pch=21, fill="yellow")+
ggtitle('Estimated Coefficients and Standard Error Without Smoking')
We can see that the regional variables seem to have more variance than the other variables and can be looked at as not that significant when taking a look at the p value of the lm4 model, especially for the dummy variable for the northwest region which fails to fall under the 0.05 threshold.
ggplot(Insurance, aes(x=charges, y= lm4$residuals))+
xlab('Charges')+
ylab('Residuals')+
geom_point(aes(x=charges, y= lm4$residuals))+
geom_smooth(method='lm',se=F)+
ggtitle("Residuals Against Actual Charges")
ggplot(Insurance) +
labs(y="Frequency", x="Residuals") +
geom_histogram(aes(x=residuals(lm4)),binwidth = 500, colour='grey')+
ggtitle("Frequency of Estimated Y Residual Values")
After looking at the residual plot we can see that most of the residual follow a normal pattern being off in the range 0 to 5000. This is good as it reflects our finding in the previous graph with yhat and actual y’s. The lower level of charges and bulk of the data is fairly accurate with the accuracy swaying a bit after the actual charges increase.
set.seed(1)
train = Insurance %>% sample_frac(.7)
test = Insurance %>% setdiff(train)
x_train = model.matrix(charges~., train)[,-1]
x_test = model.matrix(charges~., test)[,-1]
y_train = train$charges
y_test = test$charges
#for part 2-d):
x = model.matrix(charges~., Insurance)[,-1] # trim off the first column
# leaving only the predictors
y = Insurance$charges
For this report I will be splitting up the Insurance Dataset into train and test data. Both Sets of data will be made into matrixes split by subset and based on variable status (Dependent or Independent) and will be used for the ridge and lasso regressions.
Lcv.out = cv.glmnet(x_train, y_train, alpha = 1) # Fit Lasso regression model on training data
This is the Lasso model so the alpha will be set to 1. We use cross validation, cv.glmnet function, to find the optimal values of Lambda.
Lbestlam1se = Lcv.out$lambda.1se
Lbestlam1se
## [1] 841.438
By using the Lcv$lambda.1se command I found the one standard error apart lambda which was 841.438. This is the lambda that I will be using going forward.
plot(Lcv.out)
log(Lbestlam1se) # This is the selected l for the Lasso model
## [1] 6.735112
title("Mean-Squared Error by Log(Lambda)")
In the graph the pattern of MSE of various lambdas can be seen. Because the lambdas are large, it is better to show them in log form. The second dotted line corresponds with the 1SE lambda which has a log value of 6.735112.
Lasso_out = glmnet(x, y, alpha = 1)
plot(Lasso_out, xvar = "lambda")
title(main = "Coefficient Level by Log(Lambda)")
The graph shows how the lambdas have big impact on condensing the coefficients of the regression and tends to shrink the values toward zero as the lambda is increased.
# training model on the training set
Lasso_mod = glmnet(x_train, y_train, alpha=1, lambda = Lbestlam1se)
# prediction using the trained model and x_test set
Lasso_pred = predict(Lasso_mod, s = Lbestlam1se, newx = x_test)
predict(Lasso_mod, type = "coefficients", s = Lbestlam1se)[1:9,]
## (Intercept) age sexmale bmi children
## -5279.1994 211.3543 0.0000 183.6052 0.0000
## smokeryes regionnorthwest regionsoutheast regionsouthwest
## 21528.1699 0.0000 0.0000 0.0000
After running the Lasso model, we can see some stark differences in the coefficient amounts compared to the Ridge and multiple regression models. The intercept is now set to -$5,279.1994 for the Lasso model as opposed to the starting point of -$8318.0048 for the Ridge. For the independent variables, there is also some variation. Age and BMI still have positive effects on charges of $211.3543 and $183.6052 respectively on a unit basis. But the coefficients for children, and the region dummy variables are all near 0. Which means that those variables don’t have much impact according to the Lasso model. In this model, Smoking still has the biggest impact with a value of $21,528.1699 higher than those who do not smoke.
mean((Lasso_pred - y_test)^2)
## [1] 48166162
When using the Lasso model on the test data the MSE was $48,166,162.
When comparing the MSE’s, the Lasso model does not compete well against the lm4 or lm5 models.
cv.out = cv.glmnet(x_train, y_train, alpha = 0) # Fit ridge regression model on training data
This is the Ridge model so the alpha will be set to 0. We use cross validation, cv.glmnet function, to find the optimal values of Lambda.
bestlam1se = cv.out$lambda.1se
bestlam1se
## [1] 2183.552
By using the cv$lambda.1se command I found the one standard error apart lambda which was 2183.552. This is the lambda that I will be using going forward.
plot(cv.out)
log(bestlam1se) # This is the selected l for the Ridge model
## [1] 7.688708
title("Mean-Squared Error by Log(Lambda)")
In the graph the pattern of MSE of various lambdas can be seen. Because the lambdas are large, it is better to show them in log form. The dotted line corresponds with the 1SE lambda which has a log value of 7.688708.
ridge_out = glmnet(x, y, alpha = 0)
plot(ridge_out, xvar = "lambda") #shows coeficcients going to zero
title("Coefficient Level by Log(Lambda)")
The graph shows how the lambdas have big impact on condensing the coefficients of the regression and tend to shrink the values toward zero as the lambda is increased.
# training model on the training set
ridge_mod = glmnet(x_train, y_train, alpha=0, lambda = bestlam1se)
# prediction using the trained model and x_test set
ridge_pred = predict(ridge_mod, s = bestlam1se, newx = x_test)
predict(ridge_mod, type = "coefficients", s = bestlam1se)[1:9,]
## (Intercept) age sexmale bmi children
## -8318.0048 223.9518 434.0982 266.4288 355.2956
## smokeryes regionnorthwest regionsoutheast regionsouthwest
## 19931.4594 -260.0873 -378.8033 -438.7921
Based on the coefficients in the ridge model, smoking is still having the largest impact on results. If one is a smoker than their result will be $19,931.4594 higher than otherwise. Age still has a positive effect with an increase to charges of $223.9518 per year. Being a male increases medical charges by $434.0982 compared to females. BMI has a positive effect of $266.4288 per unit. The number of children still carries a positive relationship of $355.2956 per child. All of the region dummy variables are still negative. This means that people in the Southwest, Southeast, and South west have average lower charges of -$260.0873, -$378.8033, -$438.7921 dollars respectively compared to those in the Northeast region.
MSER = mean((ridge_pred - y_test)^2)
MSER
## [1] 48454842
When using the Ridge model on the test data the MSE was $48,454,842.
In this case the Ridge did not do better than the lm4 test when comparing MSE since $44,955,088 < $48,454,842. Ridge also doesn’t perform better than the Lasso model for this data set.
library(tree)
set.seed(1)
Insurance_train = Insurance %>% sample_frac(.7)
Insurance_test = Insurance %>% setdiff(Insurance_train)
tree_Insurance=tree(charges~., Insurance_train)
summary(tree_Insurance)
##
## Regression tree:
## tree(formula = charges ~ ., data = Insurance_train)
## Variables actually used in tree construction:
## [1] "smoker" "age" "bmi"
## Number of terminal nodes: 5
## Residual mean deviance: 20610000 = 1.921e+10 / 932
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8565.0 -2872.0 -852.7 0.0 1068.0 23060.0
tree_Insurance
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 937 1.329e+11 13030
## 2) smoker: no 746 2.431e+10 8251
## 4) age < 44.5 447 8.391e+09 5414 *
## 5) age > 44.5 299 6.940e+09 12490 *
## 3) smoker: yes 191 2.487e+10 31710
## 6) bmi < 30.1 93 2.564e+09 21390 *
## 7) bmi > 30.1 98 3.003e+09 41510
## 14) age < 41.5 51 4.625e+08 37520 *
## 15) age > 41.5 47 8.541e+08 45830 *
I split the data once again but in a 70/30 split for the regression tree. I then fit a tree that ran on each independent variable in search for prediction of Medical charges.
plot(tree_Insurance)
text(tree_Insurance, pretty = 0)
title("Medical Charges Regression Tree")
Here is the labeled plot of the fitted tree.
single_tree_estimate = predict(tree_Insurance, newdata = Insurance_test)
mean((single_tree_estimate - Insurance_test$charges)^2)
## [1] 32772368
When using the fitted tree to predict Charges in the test subset I received an MSE of $32,772,368.
cv.Insurance = cv.tree(tree_Insurance)
plot(cv.Insurance$size, cv.Insurance$dev, type = 'b')
title("Deviance of Tree by Size")
prune_Insurance = prune.tree(tree_Insurance, best = 5)
cv.Insurance$size[(match(min(cv.Insurance$dev), cv.Insurance$dev))] # number of optimal terminal nodes
## [1] 5
In order to get a true accurate representation in a tree form we need to prune the tree with cross validation. This was done with the cv.tree function, and the deviation associated with each tree size shows that the deviation did drop as the tree grew. Thus, the pruned tree is same original tree that had 5 terminal nodes.
plot(prune_Insurance)
text(prune_Insurance, pretty = 0)
title("Medical Charges Regression Tree")
Here is the plotted Tree after pruning, which is just the original tree since the best tree had 5 terminal nodes.
When looking at the tree it starts out with the largest determent of cost which is smoking. This initial split is expected as results of this dummy proved massive in all of the previous models. Of those that are not smokers, the next largest factor is age. The split here is set to 44.5, meaning that those non-smokers who are less than 44 years old will have an average medical cost of $5,414. Those who are older than 45 have an average price of $12,490 dollars. These results are not too surprising as age data suggested that younger people had lower medical costs since they tend to be healthier. On the side of three for those who do smoke, the next biggest factor is BMI. The split here set to 30.1 and those who have a BMI lower than that will have an average cost $21,390. This makes sense as 30 is seen as the threshold for classifying individual’s as obese. Obesity is also known to be linked with a myriad of health problems, so this is the likely cause for higher costs. For those who have a BMI greater than 30.1, the last critical variable is age. Like for non-smokers, age has a large impact on the health of individuals. The split here is at 41.5 which is lower than the split for non-smokers. For those who are under 42, the average cost for medical expenses are $37,520 and $45,830 for those who are over 42 years old. These results are not surprising considering the large impact smoking can have on one’s health as well as the added medical concerns for those who are considered obese.
After comparing all the test MSE’s, it shows that the Pruned tree did in fact do the best in terms of error reduction for this test set with an MSE of $32,772,368. I believe this type of regression does a better job of showing how variables such as smoking, BMI, and Age truly affect the charges people receive for medical treatment.
library(boot)
alpha.fn = function(data,index){
boot_train = data[index,]
boot_tree = tree(charges~., boot_train)
boot_prune = prune.tree(boot_tree, best =5)
return(predict(boot_prune,Insurance_test))
}
set.seed(1)
bs_tree = boot(Insurance_train,alpha.fn,R=100)
bs_pred = colMeans(bs_tree$t)
boot_mse = mean((bs_pred - Insurance_test$charges)^2)
The code above runs a bootstrap with 100 trees of 5 terminal nodes. I chose this since it was the optimal number of nodes in the regular decision tree after cross validation.
The MSE from the bootstrap model shows a slight improvement over the regular decision tree. With an MSE of $31,309,380 it is also better than all of the regression and shrinkage models.
The test and train data from previous sections will continue to be used to get a consistent measure for MSE and effectiveness of each model.
set.seed(1)
bag.Insurance = randomForest(charges ~., data=Insurance_train,
mtry=ncol(Insurance_train)-1,
importance=TRUE)
bag.Insurance
##
## Call:
## randomForest(formula = charges ~ ., data = Insurance_train, mtry = ncol(Insurance_train) - 1, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 19030852
## % Var explained: 86.58
plot(bag.Insurance)
The Plot shows that the out of bag error significantly drops as the trees moves from 0 to 50. From this point the out of bag error remains relatively the same up until the default 500 trees. This drop in out of bag error happens because of the randomization of the samples each tree goes through that makes the model performs better.
yhat.bag = predict(bag.Insurance,
newdata = Insurance_test)
ggplot() +
geom_point(aes(x = Insurance_test$charges, y = yhat.bag)) +
geom_abline(col = 'red')+
labs(x="Medical Charge Amount", y="Bagging Estimation of Medical Charges")+
ggtitle('Bagging Estimation Against Actual Charges')
Based on the bagging model, the overall accuracy does improve. For the most part, the charges do end up matching the actual amount toward the low end and a bit in the upper end. However, there are a good number of underestimates by the model.
While looking at the MSE’s it looks clear that the bagging model is better than ridge and lasso models based on MSE. Bagging $31,593,314 < Lasso $48,166,162 < Ridge $48,454,842 dollars.
importance(bag.Insurance)
## %IncMSE IncNodePurity
## age 122.066092 16927513212
## sex -3.783892 396333573
## bmi 122.423795 25481772138
## children 19.953585 2155214632
## smoker 203.719587 83399804007
## region 8.760776 1604923147
#gives plot
varImpPlot(bag.Insurance)
In the Bagging Model, the most important variables were Smoking, BMI, and Age. This is consistent with the regression tree model as these were the variables used to distinguish most of the output regions. These are also variables that are considered crucial determinants to one’s health.
set.seed(1)
oob.err<-double(6)
test.err<-double(6)
for(mtry in 1:6)
{
rf=randomForest(charges ~ . , data = Insurance_train, mtry=mtry, ntree=500)
oob.err[mtry] = rf$mse[500] #Error of all Trees fitted on training
pred<-predict(rf,Insurance_test) #Predictions on Test Set for each Tree
test.err[mtry]= with(Insurance_test, mean( (charges - pred)^2)) # "Test" Mean Squared Error
}
round(oob.err,2) # out of bag error
## [1] 42013439 19345867 18298978 18571943 18723498 19209503
matplot(1:mtry , oob.err, pch=20 , col= "red",type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split")
legend("topright",legend= "Out of Bag Error",pch=19 ,col= "red")
The plot shows a significant drop in out of bag error when the number of variables goes past one which is suspected but the error does seem to plateau afterwards.
Based on the out of bag errors I will choose the model with a mtry of 3 since it yields the lowest out of bag error compared to values.
set.seed(1)
rf.Insurance = randomForest(charges~.,
data = Insurance_train,
mtry = 3,
importance = TRUE,
do.trace = 100) #do.trace gives you the OOB MSE for every 100 trees
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 100 | 1.88e+07 13.26 |
## 200 | 1.844e+07 13.00 |
## 300 | 1.834e+07 12.93 |
## 400 | 1.832e+07 12.92 |
## 500 | 1.835e+07 12.94 |
yhat.rf = predict(rf.Insurance, newdata = Insurance_test)
#plot y vs actual y
ggplot() +
geom_point(aes(x = Insurance_test$charges, y = yhat.rf)) +
geom_abline(col ='red')+
labs(x="Medical Charge Amount", y="Random Forest Estimation of Medical Charges")+
ggtitle('Random Forest Estimation Against Actual Medical Charges')
The results are similar to the graph of the bag predictions where most of the charges are matched up pretty well to the predictions with a good amount of prediction underestimates.
With an MSE of $29,310,477 dollars, the random forest is an improvement over the other models. This can be attributed to the randomizing of variables used on splits which can further identify what variables are worth more.
importance(rf.Insurance)
## %IncMSE IncNodePurity
## age 100.700245 18530616288
## sex -2.731046 824290225
## bmi 95.281838 21028732072
## children 17.763188 2559348787
## smoker 193.575538 83075564664
## region 7.016027 2344957753
varImpPlot(rf.Insurance)
Similar to the results of the plot of the bagging test, the variables for Smoking, age, and BMI are still the most important variables due to their indications of one’s health.
round(test.err ,2) # test error
## [1] 56240065 31017858 29247440 29766505 30725974 31617846
matplot(1:mtry , cbind(oob.err,test.err), pch=20 , col=c("red","blue"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split")
legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","blue"))
While comparing the effect on Test error with the out of bag error we can see that choosing the lowest out of bag error will lead to the best performing model as performance doesn’t improve after over-fitting.
library(gbm)
set.seed(1)
boost.Insurance = gbm(charges~.,
data = Insurance_train,
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 4)
#make prediction
yhat.boost = predict(boost.Insurance,
newdata = Insurance_test,
n.trees = 5000)
ggplot() +
geom_point(aes(x = Insurance_test$charges, y = yhat.boost)) +
geom_abline(col = 'red')+
labs(x="Medical Charge Amount", y="Boosting Estimation of Medical Charges")+
ggtitle('Boosting Estimation vs Actual Charges')
The model’s prediction is fairly good toward the bottom left like the previous models. It does still underestimate some cases, but at this point I would classify those points as potential special situations. However, it does overestimate more than some of the previous models making it less accurate than the random forest and Bagging methods.
With an MSE of $37,401,104 for the boosting model didn’t do too well compared to the Tree, Bagging and Random Forrest models but it was an improvement over the multiple regressions, Ridge and Lasso models. This improvement can be explained by the model’s estimation of error as more observations are added to the model.
summary(boost.Insurance)
Then importance matrix further exemplifies the point that Smoking is the largest determinant of medical charges. But in this model the BMI has a bit more significance than age when compared to the previous models.
# create hyperparameter grid
hyper_grid <- expand.grid(
shrinkage = c(.0001, .001, .01),
interaction.depth = c(2, 4, 6),
bag.fraction = c(.65, .8, 1), #subsampling with stochastic gradient descent
optimal_trees = 0, # a place to dump results
min_MSE = 0 # a place to dump results
)
# total number of combinations
nrow(hyper_grid)
## [1] 27
# grid search
for(i in 1:nrow(hyper_grid)) {
# reproducibility
set.seed(1)
# train model
gbm.cv <- gbm(
formula = charges ~ .,
distribution = "gaussian",
data = Insurance_train,
n.trees = 5000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
bag.fraction = hyper_grid$bag.fraction[i],
n.cores = NULL, # will use all cores by default
verbose = FALSE,
cv.folds = 5
)
# add min training error and trees to grid
hyper_grid$optimal_trees[i] <- which.min(gbm.cv$cv.error)
hyper_grid$min_MSE[i] <- min(gbm.cv$cv.error)
}
hyper_grid %>%
dplyr::arrange(min_MSE) %>%
head(10)
For the grid, I choose to use 3 different values for shrinkage, interaction depth, and bag.fraction. I used a for loop to run through the 27 combinations of parameters and then ran a boost model for each. The top 10 performers after a 5-fold cross validation are shown above. The parameter measures that I used for the optimal model are: Shrinkage = 0.01, interaction depth = 6, bag.fraction = 0.80 , optimal trees = 402.
set.seed(1)
yhat.gbm <- gbm(charges~., data = Insurance_train,
distribution = "gaussian",
n.trees=402,
interaction.depth=6,
shrinkage = 0.01,
bag.fraction = 0.80,
verbose=F)
yhat.gbm2 <- predict(yhat.gbm, Insurance_test, n.trees = 402)
ggplot() +
geom_point(aes(x = Insurance_test$charges, y = yhat.gbm2)) +
geom_abline(col = 'red')+
labs(x="Medical Charge Amount", y="New Boosting Estimation of Medical Charges")+
ggtitle('Tuned Boosting Estimation vs Actual Charges')
The model did a good job matching the bulk of the real medical charges but does suffer from the same issue of underestimates on some of the observations. The model did significantly close the gap between the prediction and real value which will be shown by the MSE.
The New boosting model gave an MSE of $27,785,368 which is a considerably improvement over the original Boost model which was $37,401,104 dollars. The process proved its worth and could be improved further with more iterations of the grid search.
summary(yhat.gbm)
The Importance matrix is very close to the original boost model and continues the trend of smoking carrying three times more importance then BMI and age.
In order to use the XGBoost model, I had to turn my categorial variables into individual binary dummy variables which are now in the data frame “df”.
set.seed(1)
Insurance_train2 = df %>%
sample_frac(.7)
Insurance_test2 = df %>%
setdiff(Insurance_train2)
Y_train2 <- as.matrix(Insurance_train2[,"charges"])
X_train2 <- as.matrix(Insurance_train2[!names(Insurance_train2) %in% c("charges")])
dtrain2 <- xgb.DMatrix(data = X_train2, label = Y_train2)
X_test2 <- as.matrix(Insurance_test2[!names(Insurance_train2) %in% c("charges")])
Y_test2 <- as.matrix(Insurance_test2[,"charges"])
dtest2 <- xgb.DMatrix(data = X_test2, label = Y_test2)
The data is then split into test and train data in matrix form.
set.seed(1)
Insurance.xgb = xgboost(data=dtrain2,
max_depth=6,
eta = 0.3,
nrounds=40, # max number of boosting iterations (trees)
lambda=1,
print_every_n = 10,
objective="reg:linear")
## [1] train-rmse:12783.659180
## [11] train-rmse:2984.226807
## [21] train-rmse:2386.794189
## [31] train-rmse:1858.854004
## [40] train-rmse:1458.274170
yhat.xgb <- predict(Insurance.xgb,X_test2)
#Part a
ggplot() +
geom_point(aes(x = Insurance_test2$charges, y = yhat.xgb)) +
geom_abline(col = 'red')+
labs(x="Medical Charge Amount", y="XGBoost Estimation of Medical Charges")+
ggtitle('XGBoost Estimation vs Actual Charges')
The plot looks very similar to the others as it contains some under estimations in the lower and high price ranges. Other than that, the model slightly overestimates the values for medical charges.
The XGBoost model improves the process of estimating a regression model with increased weight on some parameters. As a result, the model yields a smaller MSE then the multiple regression, and shrinkage models so far with a value of $33,378,976.
importance <- xgb.importance(colnames(X_train2),model=Insurance.xgb)
importance
xgb.plot.importance(importance, rel_to_first=TRUE, xlab="Relative Importance")
We can see that the smoking, age, and BMI variables were the most important determinants of outcome by a large margin. Though within this lead, Smoking is far ahead in its importance to the models outcome.
# create hyperparameter grid
hyper_grid2 <- expand.grid(
eta = c(0.5,0.3,0.01,0.001),
colsample_bytree = c(0.4,0.6,0.8,1.0),
max_depth = c(2,3,4,5,6),
#gamma = 1, #Minimum loss reduction required to make a further partition on a leaf node of the tree
#min_child_weight = 1,
lambda = 0, # coef of the L2 regularization
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
nrow(hyper_grid2)
## [1] 80
The grid search that I will be using will consist of 80 combinations of parameters.
#set up cross validation for paramters
for(i in 1:nrow(hyper_grid2)) {
set.seed(1)
cv.nround = 200
cv.nfold = 3
params = list(
eta = hyper_grid2$eta[i],
#subsample = 1.0,
colsample_bytree = hyper_grid2$colsample_bytree[i],
max_depth = hyper_grid2$max_depth[i],
lambda = 0
)
# train model
Insurance.xgb.cv <- xgb.cv(param=params, data = dtrain2,
nfold = cv.nfold,
nrounds=cv.nround,
early_stopping_rounds = 20, # training will stop if performance doesn't improve
#for 20 rounds from the last best iteration
verbos = 0 #do not show the iterations' results
)
# add min training error and trees to grid
hyper_grid2$optimal_trees[i] <- Insurance.xgb.cv$best_iteration
hyper_grid2$min_RMSE[i] <- Insurance.xgb.cv$evaluation_log$test_rmse_mean[Insurance.xgb.cv$best_iteration]
}
#show which parameters did best
hyper_grid2 %>%
dplyr::arrange(min_RMSE) %>%
head(10)
The best xgboost parameters are eta = 0.3, colsample = 0.8, maxdepth = 2, number of trees = 33
params = list(
eta = 0.3,
colsample_bytree = 0.8,
max_depth = 2
)
watchlist <- list(train = dtrain2, test = dtest2)
set.seed(1)
Insurance.xgb2 = xgb.train(param = params, data=dtrain2,
nrounds = 33,
print_every_n = 10,
watchlist = watchlist)
## [1] train-rmse:13031.306641 test-rmse:14020.828125
## [11] train-rmse:4158.267578 test-rmse:5381.097168
## [21] train-rmse:3956.725586 test-rmse:5253.666016
## [31] train-rmse:3878.510254 test-rmse:5246.597168
## [33] train-rmse:3866.323975 test-rmse:5250.906738
yhat.xgb2 <- predict(Insurance.xgb2,X_test2)
Tune_XGBoost = round(mean((yhat.xgb2 - Insurance_test2$charges)^2),2)
The MSE for the tuned model yields an MSE of $27,572,014. This is a vast improvement over the original XGBoost model as well as edging out the tuned boost model.
library(keras)
library(ISLR)
library(dplyr)
library(MASS)
library(tensorflow)
use_condaenv("r-tensorflow")
To begin, I must scale the train and test data used before.
set.seed(1)
Insurance_train2 = df %>%
sample_frac(.7)
Insurance_test2 = df %>%
setdiff(Insurance_train2)
Y_train2 <- as.matrix(Insurance_train2[,"charges"])
X_train2 <- as.matrix(Insurance_train2[!names(Insurance_train2) %in% c("charges")])
X_test2 <- as.matrix(Insurance_test2[!names(Insurance_train2) %in% c("charges")])
Y_test2 <- as.matrix(Insurance_test2[,"charges"])
train_data <-scale(X_train2)
test_data <- scale(X_test2)
set.seed(1)
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = dim(train_data)[2]) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1)
The Neural Network will have two hidden layers of 64 nodes.
set.seed(1)
model %>% compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error")
)
Compiled the model and defined the loss function to be measured in MSE.
early_stop <- callback_early_stopping(monitor = "val_loss", patience = 10)
The model will end early if there isn’t improvement after 10 epochs or iterations of the model.
set.seed(1)
Fit_Model <- model %>% fit(
train_data,
Y_train2,
epochs = 250,
validation_split = 0.2,
callbacks = list(early_stop),
verbose = 0
)
To the final fitted model will have maximum epochs set at 250. The validation split for train data is also set at 20%.
plot(Fit_Model, metrics = "mean_absolute_error", smooth = FALSE)
The plot shows that mean absolute error drops significantly as the number of epochs increases up until after 50 when the loss in mean absolute error slows down.
set.seed(1)
test_predictions <- model %>% predict(test_data)
Neural_Net_MSE = mean((Y_test2 - test_predictions)^2)
Neural_Net_MSE # $39,521,827
After running the model on the test data, we get an MSE of $39,521,827. This value is slightly worse than the original boosting model while still outperforming the multiple regression and shrinkage methods.
library(caret)
library(FNN)
library(dplyr)
set.seed(1)
data <- df
scale(data[,c("age","bmi")]) # scale age and bmi, every other variable is a dummy variable
#split into test and train data
set.seed(1)
K_train = data %>%
sample_frac(.7)
K_test = data %>%
setdiff(K_train)
#KX_train =model.matrix(charges~., K_train)[,-1]
#KX_test =model.matrix(charges~., K_test)[,-1]
#KY_train = K_train$charges
#KY_test = K_test$charges
For the KNN model, I had to scale down the age and BMI variables due to them being represented by different metrics. Since the data is from the “df” dataset, all other variables should be represented as a binomial of 0 or 1.
library(Metrics)
krmse = function(actual, predicted){
sqrt(mean((actual - predicted)^2))
}
make_knn_pred = function(k = 1, training, predicting) {
pred = FNN::knn.reg(train = training[, -1],
test = predicting[, -1],
y = training$charges, k = k)$pred
act = predicting$charges
rmse(predicted = pred, actual = act)
}
k = seq(1,10,1)
Here I created a function to make a prediction based on the KNN model as well as finding the root MSE of the prediction function. To fins the optimal “K” for the model will be testing k’s from 1 to 10.
knn_tst_rmse = sapply(k, make_knn_pred, training = K_train,
predicting = K_test)
knn_results = data.frame(k,round(knn_tst_rmse,2))
colnames(knn_results) = c("k", "Test MSE")
ggplot(knn_results, aes(x = k, y = `Test MSE`)) +
geom_line(color="orange") +
geom_point(colour = 'blue') +
theme(panel.background = element_blank(), plot.title = element_text(hjust = 0.4)) +
ggtitle('Number of K with RMSE') +
xlab('K') + ylab('RMSE') +
geom_vline(xintercept = k[which.min(knn_tst_rmse)], linetype="dashed",
size = 0.5) +
annotate(geom="text",x=1, y=1000, label="k = 1", color = "red")
Here we can see that the optimal amount of K’s is actually 1. A KNN model testing on one neighbor gives the lowest root MSE and will be the parameter in the model I use to compare with the prior models.
#Square root of MSE for predicting the entire data using KNN model
knn_y = FNN::knn.reg(train = K_train[, -1],
test = K_test[, -1],
y = K_train$charges, k = 1)$pred
K1_RMSE = rmse(predicted = knn_y, actual = K_test$charges)
KNN_MSE = K1_RMSE^2
KNN_MSE
## [1] 32568.05
The MSE for a KNN model with 1 neighbor is $32,568.05. This value is by far the best result in the study. I believe this model performed very well since every other model has some trouble distinguishing smokers from the rest of the population. In the KNN model, since the smoker’s values are clustered near each other, it is more likely that the nearest neighbor of a high charge value smoker is another high charge value smoker. This result may not hold the same weight in other data sets with more normal data, but it is interesting to see a model calculate an MSE of this magnitude.
newtable <- Table[order(Table$Result),]
newtable$model_name <- factor(newtable$model_name, levels = newtable$model_name)
options(scipen=10)
theme_set(theme_bw())
# Draw plot
ggplot(newtable, aes(x=model_name, y=Result)) +
geom_bar(stat="identity", width=.5, fill="lightcoral") +
labs(title="Ordered Bar Chart of Test MSE") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
ylim(0,50000000)
By placing each Test model’s MSE in order, the gap between the KNN model is clear. Even though the model did very well, I doubt that it is very useful since the main objective was to find the optimal prediction model. Since KNN is unsupervised, it isn’t that likely that we can rely on this model to tell us much about the future. Of all of the other models, I would select the Tuned XGBoost model for future analysis. I believe with more rounds of grid searching this model will improve dramatically. It is also interesting to see how the tree-based models all outperformed the regression models. The main problem for these models was determining whether a patient was a smoker. Since the regression line models are based on averages, they had a harder time dealing with the large charges that smokers have. The tree-based models found a way to reduce error by focusing on the smoker variable and creating that split right away. They also helped uncover age and BMI as the other crucial variables in determining health care costs.
Overall, I felt like each model did its job in telling a different story about the data. The regression models were good for highlighting the average effects of different units against y while the tree-based methods took a logical approach to allocating average cost through steps. Smoking is by far the most important factor in one’s health according to this data set, which would be interesting if this assertion held up with more data. Regardless, we know smoking leads to health issues and the practices that deal with respiratory infections might be partly responsible for the rise in the cost of the US health sector. The effect that age has on medical charges also makes sense due to the association with the older population and prescription drug costs. This is also an area that is expected to rise in the coming years with pharmaceutical companies upping charges on medicine and with new breakthroughs in medicine to treat more people. BMI also seems like a variable that is critical to the growth of the medical sector since foods have become more processed and unhealthier. If childhood obesity continues to grow in the country, then we can expect more of the issues associated with being overweight in adults. The service industries rise in cost may also be an effect on the growing rates of weight and obesity. It would be interesting to also look into services like counseling and phycology with the increasing need for mental health wellness amongst the public. The variables didn’t provide much importance to the models, but I still suspect a regional affect to be relevant. However, this effect might be focused more on population density and resources than the given patient’s geographical region. There are more demographics out there which could be used to determine medical costs and this report was a great starting point to see how large the disparities between groups can really be. With health data becoming more readily available to a level of care, I believe we may see large expansion in the industry. This expansion will naturally lead to higher costs but in the long run will benefit the population as a whole with the regression of once pressing medical issues.